Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for unmarshalling error for keepequal/dropequal #2794

Merged

Conversation

rashmichandrashekar
Copy link
Contributor

@rashmichandrashekar rashmichandrashekar commented Mar 27, 2024

Description:
Fixing a bug - Fix for targetAllocator returned scrape config crashing otel MetricsReceiver when relabel or metric relabel action has action:keepequal or dropequal

Link to tracking Issue(s):
2793

Testing:
Tested that it doesn't crash metricsreceiver

@rashmichandrashekar rashmichandrashekar requested review from a team March 27, 2024 00:35

var jobToScrapeConfig map[string]interface{}
err = json.Unmarshal(jsonConfig, &jobToScrapeConfig)
if jobToScrapeConfig != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this scary looking surgery, we should probably do something similar to what prometheus-operator does, and manually build the scrapeconfig out of appropriate yaml structs: https://github.com/prometheus-operator/prometheus-operator/blob/4e3b2bcea44cbfdbd5c4975d777240285721dc6b/pkg/prometheus/promcfg.go#L1645/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @swiatekm-sumo. But that would mean that whenever there is an update to the scrape config, we would need to manually update this as well which seems tedious. Also, I dont see any libraries I can reuse. Until it is fixed in prometheus, could this be considered as a fix since the marshal implementation simplifies the conversion and we can rely on the interface implementation to reliably convert to yaml? Once that is done, we can just remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swiatekm-sumo - Following up. Please let me know your thoughts. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that your approach fixes this specific problem, but is still leaves us uncertain if we don't have other similar problems that we have yet to encounter. I'm not against merging this specific fix as-is, but I'd like to look for a more systemic solution, where we can be assured our serialized scrape configs are correct. @jaronoff97 @pavolloffay wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @swiatekm-sumo. As a part of this fix, I also did test out all the relabel actions available with prometheus configuration. These 2 were the only actions that cause problems and since this is unique to a bug, should not likely reoccur. If there will be other issues related to marshalling, it would also affect all prometheus users dependent on it and the ideal fix should be from the prometheus implementation of marshalling. We should also be able to reliably take dependency on the native prometheus marshalling/unmarshalling implementation without duplication of effort. Please let me know what you think. Thanks again!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to write unit tests checking this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @swiatekm-sumo. I have added tests checking that unmarshaling doesn't result in errors for any prometheus action supported. Please take a look. Thank you!

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm ok with this fix, although I'd like to open an issue to find a better solution as well. But I'd like @pavolloffay and @jaronoff97 to give their opinions as well.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the linter error, otherwise this looks good to me.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, and being patient with my feedback. As I said earlier, I consider this a temporary fix until we can figure out a more robust way of dealing with the problem. However, fixing the bug is more important than other considerations here in my view.

@jaronoff97
Copy link
Contributor

Could you add a comment to the function signature explaining why we need to do this, linking this issue prometheus/prometheus#12534

@rashmichandrashekar
Copy link
Contributor Author

Thanks for fixing this, and being patient with my feedback. As I said earlier, I consider this a temporary fix until we can figure out a more robust way of dealing with the problem. However, fixing the bug is more important than other considerations here in my view.

Thanks @swiatekm-sumo. Your feedback makes sense to me.

@rashmichandrashekar
Copy link
Contributor Author

Could you add a comment to the function signature explaining why we need to do this, linking this issue prometheus/prometheus#12534

Thanks @jaronoff97. Added the comment. Please take a look.

@rashmichandrashekar
Copy link
Contributor Author

@jaronoff97 - I have committed your suggestion. Please merge if all looks good, Thanks again!

@jaronoff97
Copy link
Contributor

jaronoff97 commented Apr 10, 2024

@rashmichandrashekar before I merge, did you confirm this worked by running the TA locally and the scrape_configs response worked with the collector? If not, it may be helpful to add an e2e here...

@rashmichandrashekar
Copy link
Contributor Author

@rashmichandrashekar before I merge, did you confirm this worked by running the TA locally and the scrape_configs response worked with the collector? If not, it may be helpful to add an e2e here...

@jaronoff97 - Yes, I did test out that this works by running TA and collector locally :)

@jaronoff97 jaronoff97 merged commit e107ffe into open-telemetry:main Apr 11, 2024
31 checks passed
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
)

* fix

* add require

* fixing lint errors

* adding unit tests

* fixing some yaml

* adding few more tests

* fixing lint

* adding comment

* Update cmd/otel-allocator/server/server.go

Co-authored-by: Jacob Aronoff <[email protected]>

---------

Co-authored-by: Jacob Aronoff <[email protected]>
rubenvp8510 pushed a commit to rubenvp8510/opentelemetry-operator that referenced this pull request May 7, 2024
)

* fix

* add require

* fixing lint errors

* adding unit tests

* fixing some yaml

* adding few more tests

* fixing lint

* adding comment

* Update cmd/otel-allocator/server/server.go

Co-authored-by: Jacob Aronoff <[email protected]>

---------

Co-authored-by: Jacob Aronoff <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants